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ABSTRACT 

The advisability of drafting guidelines for 
evaluating statistical software is considered. The Committee on 
Applied and Theoretical Statistics of the National Research Council 
has decided to initiate a proDect to articulate issues relating to 
guidelines and to determine their priorities. Because there has been 
a proliferation in statistical software, more statistical work is 
being done by people with little or no training in statistics, a fact 
that makes guidelines increasingly important. Benefits of the 
proposed guidelines for users, consumers, producers, and the 
scientific community are considered. The aspects of statistical 
packages that require guidelines include: (1) coverage; (2) numerical 
accuracy; (3) graphics; (4) data retrieval and data manipulation; (5) 
data transfer; (6) documentation; (7) user interface; (8) device 
interfaces; (9) speed; and (10) extensibility. Guidelines have at 
least four important purposes: to provide a commonality to enable 
consumers to share knowledge; to help code existing knowledge; to 
taise consumer expectations; and to provide some long-term stability 
in a rapidly changing environment. (SLD) 
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1. Introduction 

The Committee on Applied and Theo- 
retical Statistics (CATS), National Research 
Council (NRC), is a standing committee of 
the NRC Board on Mathematical Sciences 
(BMS) designed to monitor research trends 
and issues affecting the statistical sciences. 
Dr. Eddy is the Chair of CATS; Dr. Cox 
directs the BMS. The issue of guidelines for 
statistical software is broad: on the one hand, 
it affects all applications of statistics and the 
perceptions about statistics of users and oth- 
ers outside the field; on the other hand, it is 
a problem that currently lacks a locus of re- 
sponsibility and action for its solution. CATS 
has decided to initiate a project to articulate 
and prioritize issues in the guidelines area, in 
an attempt to bring the problems closer to the 
attention and action agenda of other groups. 

2. Summary of the Issues 
•2.1. The Past 

Since 1970, and particularly since 1980 
with the advent of inexpensive, powerful per- 
sonal computing, the use of statistical soft- 
ware packages has proliferated. The out- 
puts of these packages are used to justify and 
evaluate important decisions affecting seg- 
ments of national and world populations and 
economies. As these packages have become 
more familiar, they are used by a wider and 
increasingly more statistically naive class of 
user, with the result that evaluation of out- 
puts and appropriateness of use are less fre- 
quently questioned. 

Serious quality and reliability issues in sta- 
tistical software have been documented, in- 



cluding: inconsistent outputs (even by ma- 
jor packages) to identical simple queries, in- 
appropriate algorithms, incomplete or incon- 
sistent repetitions of procedures, poor docu- 
mentation, numerical instability, and lack of 
adherence to statistical standards. The ef- 
fects of these defects are multiplied as statis- 
tical software becomes more widely used for 
routine purposes in business, science, and en- 
gineering. The potential repercussions are es- 
pecially alarming when one realizes that un- 
questioned outputs are being routinely ac- 
cepted as reliable for purposes affecting deci- 
sions of national importance. As the software 
increasingly becomes a surrogate for the ex- 
pert statistician, human abilities to monitor 
performance decrease as reliance on outputs 
increases. 

In the middle 1970's there were at most 
20 commercial statistical software packages. 
Commonly called statistical packages, they 
are computer software, each of which consists 
of a collection of computer programs that per- 
forms statistical computations and, in some 
instances, presents analyses of the compu- 
tations. These packages were, almost with- 
out exception, originally developed at lead- 
ing research universities. The total number 
of users was small and the users were gen- 
erally quite skilled, both in statistics and in 
computing. Most users submitted batch pro- 
grams on punched cards to IBM mainframe 
computers and received support from the lo- 
cal computer center and often directly from 
the package developers. Even at that time 
there was great concern in the statistical com- 
munity about the quality and transportability 
of statistical software (Francis and Heiberger, 
1975). 



By the late 1970's a major proliferation 
of statistical software had begun. Minicom- 
puters and superminicomputers had become 
available to smaller research groups. The 
number of packages was increasing. Time- 
shared interactive computing was becoming 
commonplace in research environments. 

2.2. Today 

Today it is possible for anyone with a per- 
sonal computer and a language processor to 
create a piece of software and call it a sta- 
tistical package. There is great demand for 
statistical software because decision makers 
increasingly require quantitative information 
on which to make and defend key decisions. 
Often, decisions based on a statistical analysis 
(however flawed) are more readily accepted. 
There is a shortage of expert statisticians for 
these purposes and such expertise is expensive 
to develop and maintain. A statistical pack- 
age is often an attractive alternative, becom- 
ing a pseudo-statistician. This has led to bur- 
geoning growth and the proliferation of both 
good and bad statistical software. 

Today numerous commercial and non- 
commercial statistical software packages are 
available. A conservative estimate is that 
there are 300 commercial packages available 
and many more packages under development. 
It i.s difficult to determine the number of pack- 
ages that are distributed free-of-charge to the 
user but the count is similar. Statisticians re- 
main deeply concerned about the quality of 
the statistical software they and others use 
(Schervish, 1988). 

Because the supply of statistical software 
packages and computing power is growing 
quickly and will continue to be less expen- 
sive than expert statistical advice, more and 
more statistical work will be done by people 
with little or no training in statistics using 
statistical software packages of increasing "so- 
phistication." While this has the great ad- 
vantage of satisfying manpower needs with 
fewer highly trained statisticians, it has the 
great disadvantage that these untrained users 



will be unable to evaluate the quality of the 
program's performance and output. Key na- 
tional, business, and social decisions will be 
made in this environment. What is needed 
to reverse this declining trend are guidelines 
based upon established statistical principles 
for benchmarking and assessing the perfor- 
mance and output of statistical software pack- 
ages. The increased utilization of supercom- 
puters and other high performance computers 
has emphasized the need for such guidelines. 
The issue here is not whether these guidelines 
will eventually lead to standards for statisti- 
cal boftware or even whether such .standards 
are desirable. (The difference between guide- 
lines and standards is discussed in Section 6.) 
Rather, it k necessary to provide informa- 
tion and a framework to ensure minimal ac- 
ceptable statistical functionality and ease of 
interpretation and use in statistical software 
and basic tools to verify that the guidelines 
have been met. We anticipate that guidelines 
and benchmarks as proposed will be invalu- 
able to users and useful to developers of sta- 
tistical software. This will serve to eliminate a 
well-defined set of problems in many software 
packages at the development or enhancement 
stage, thereby reducing greatly at least one 
kind of misuse of statistics. 

3. Need 

Wilkinson and DaUal (1977) reported the 
results of calculating the means, standard de- 
viations, and correlation of three artificial ob- 
servations on three variables using four sta- 
tistical packages. Only one of the four pack- 
ages calculated all of the sample statistics cor- 
rectly. This test case demonstrates clearly 
the need for improvements in software quality, 
even in the most trivial calculations, as well as 
guidelines and tools for evaluating statistical 
software. 

Teitel (1981) constructed two data sets 
that were nonrectangular but still had simple 
relationships between the data records. He 
then invited the distributors of several widely 
used statistical packages to perform simple 
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counts from the data. The results, reported in 
companion papers, are difficult to summarize 
simply, but it appears that no two packages 
got the same counts. These results are most 
distressing. 

A complicating aspect of statistical soft- 
ware packages is that they are used by 
the most sophisticated researchers (Eddy 
et. al., 1986) and by the most statistical^' 
naive laypersons. Therefore, these packages 
must have a broader range of capabilities 
than other kinds of scientific software, and 
must be integrated vertically so that anal- 
yses which should be conforming — such as 
a detailed analysis by the expert statisti- 
cian and a broader confirming analysis by 
management — are not divergent. Guidelines 
and baseline test data for users and producers 
will help span this range smoothly. 

4. Benefits 

4.1. Benefits to Users 

The most obvious beneficiaries of th*> 
planned set of guidelines will be the users (i.e., 
purchasers and end-users) of the statistical 
software. They will have available an objec- 
tive set of guidelines for evaluation of statisti- 
cal software. This will allow prospective pur- 
chasers to make objective comparisons based 
on criteria that have been selected by experts 
(as opposed lO the haphazard evaluation cri- 
teria that are often used now). This will allow 
end-users to ascertain whether aspects of the 
software which they do not fully understand 
are as trustworthy (based on criteria deter- 
mined by experts) as the aspects that they 
do fully understand. The statistical software 
industry output is estimated at $250 million 
per year. Nearly all of these purchasing deci- 
sions have heretofore been made without the 
benefit ofobjec^^*ve evaluation. 

4.2. Benefit to Consumers 

The largest group of beneficiaries of the 
planned set of guidelines are the consumers of 
the results produced by the software. Their 



benefit is indirect in two ways. First, they do 
not actually use the software but rather just 
the output. Second, the consumers may not 
even have the advantage of knowing that the 
software that produced the results has been 
evaluated using guidelines and benchmarks. 
In any case, these consumers will have the 
advantage that their results are based on soft- 
ware that is acceptable to the community of 
statistical experts (if that is the case). 

4.3. Benefits to Producers 

The smallest group of beneficiaries of the 
planned set of guidelines, but certainly the 
inost important, are the producers ^.he soft- 
ware. They will benefit in two direct ways. 
First, they will have the opportunity to com- 
pare their software to other software based 
on an objective set of guidelines and bench- 
marks. Currently, the only comparisons are 
generated by individual commercial produc- 
ers and there is an attendant suspicion of bias 
in the selection of test cases. Second, pro- 
ducers of software, by having test cases and 
guidelines for evaluation, will have the oppor- 
tunity to raise the quality of their softwai** 
We expect that the user and ultimately the 
producer community will come to the point 
where software will be unacceptable unless it 
can meet the guidelines that are planned. 

4.4. Benefits to the Scientific Community 

In a recent article, Molenaar (1989) wrote, 
"The rapid growth of data analysis by non- 
experts has aggravated the problem that the 
scientific community lacks a suitable system 
for assessing and controlling the quality of sta- 
tistical software." He went on to discuss the 
costs to the scientific community of bad statis- 
tical software and to argue that the commu- 
nity should modify the system of rewards and 
punishments to draw attention to the prob- 
lem. Molenaar (1989) and Victor (1985) be- 
fore him obviously felt that information about 
the quality of software must be made available 
to the larger scientific community, not just the 
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statisticiacs, *n order to effect the desired im- 
provements. 

In the final analysis, the most important 
benefit is, naturally, the confidence that fu- 
ture research and applications will not be dan- 
gerously flawed through the naive and unwit- 
ting use of poor, inaccurate or incorrect sta- 
tistical packages. 

5. Aspects Needing Guidelines 

There are a variety of different aspects 
or components of statistical software that 
require guidelines for evaluation and use. 
The only previous systematic effort aimed at 
evaluating the quality of statistical software 
(Francis, Heiberger, and Velleman, 1975) has 
provided some background on the appropri- 
ate aspects. However, as computing environ- 
ments have changed, so have the important 
aspects. A list of those aspects of statisti- 
cal software that require evaluation would in- 
clude at least the following topics. 

5.1. Coverage 

What statistical procedures does the pack- 
age include? The procedures which are avail- 
able and their implementation limit the qual- 
ity jf the user's results as well as the ease with 
which they are produced. Commonly used 
procedures include descriptive statistics (both 
numerical and graphical), random number 
generation, confidence intervals and tests, re- 
gression analysis, correlation and linear model 
fitting, and categorical data analysis. Nonlin- 
ear model fitting is important in the physical 
sciences. Guidelines as to what are the best 
procedures known to date that make possible 
good statistical practice are needed. 

Some software packages are more com- 
prehensive or are specialized for particular 
kinds of problems. In such cases too, guide- 
lines as to what procedures are necessary for 
good modern statistical practice are needed 
by consumers. Included in a list of more ad- 
vanced/specialized topics would be: time se- 
ries analysis, loglinear models (and extensive 
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cross-tabulation), survival analysis, nonpara- 
metric smoothing, linear algebra, construc- 
tion and analysis of complex experimental de- 
signs, dose-respoflse modeling, and ^ . lamic 
data display. 

5.2. Numerioil Accuracy 

Both the common mathematical functions 
and the algorithms used to implement statis- 
tical procedures must exhibit acceptable nu- 
merical quality. There already exist hard- 
ware standards such as ANSI/IEEE 754-1985 
for floating point arithmetic and most hard- 
ware manufacturers adhere to those stan- 
dards. These standards provide a foundation 
on which to build statistical software which 
will have the highest possible numerical ac- 
curacy. There already exist a few test data 
sets and some unsupported recommendations 
for numerical accuracy of statistical software. 
The project described here will result in a 
systematic body of data and some system- 
atic guidelines for evaluation of this aspect. It 
is actually becoming more important to have 
guidelines available with the increased use of 
supercomputers and other high performance 
computers, as their added speed allows users 
to do more computations on larger data sets 
than in the past. 

5.3. Graphics 

Computer graphics is fast becoming an es- 
sential component of statistical data analy- 
sis. Because of the larger size of data sets 
and the increased complexity of the kinds 
of analyses which are done using high per- 
formance computers, graphics has become a 
crucial analytical tool. Research on graphi- 
cal perception demonstrates clearly that the 
eye-brain system performs certain tasks re- 
quired to decode statistical data represented 
graphically better than other tasks (indeed, 
such tasks can be arranged hierarchically). 
The common means of representing statisti- 
cal data graphically give rise systematically 
to perceptual distortions (e.g., use of areas to 



represent linear variables such as la statisti- 
cal maps and improper uses of color) (Cleve- 
land, 1985). Statistical data are typically 
multidimensional and efficient, comprehensi- 
ble means of representing multidimensional 
data are needed. The need for guidelines for 
statistical graphics is critical, particularly be- 
cause non-scientific users of graphics tend to 
emphasize the ability of graphic to capture 
the reader's attention with less regard for the 
accuracy of the message received. 

5.4. Data Retrieval and Manipulation 

Teitel (1981) uncovered enormous discrep- 
ancies between major commercial trackages 
in handling and making counts and infer- 
ences from data sets of only moderate com- 
plexity, and proposed some fairly simple but 
quite revealing data sets to test the database 
management capabilities of statistical soft- 
ware systems. It is advisable to go fur- 
ther and consider searching, splitting, ag- 
gregating, transforming, handling of char- 
acter as well as numeric data, handling of 
missing values, sorting, array construction 
and manipulation. Statistical databases dif- 
fer from traditional transaction databases in 
that records typically are processed in the 
attribute domain rather than the record do- 
main. Paradigms leading to models of effi- 
cient statistical databases are needed. Per- 
formance guidelines are needed to construct 
and evaluate such paradigms. 

5.5. Data Transfer 

One critical feature of a program in a sta- 
tistical package is its ability to read data from 
and write data to the world outside the pro- 
gram. This can be as nimple as reading an 
ASCII text file (a feature missing from some 
programs) and as complex as transferring an 
internal datcL otructure from the format of one 
program to the format of another. The im- 
portance of this feature can be inferred from 
the fact that there exist commercial programs 
which provide only this capability. 



5.6. Documentation 

MuUer (1978) (also. Berk and Francis, 
1978) has provided a set of objectives for the 
evaluation of statistical software documenta- 
tion. 

"A user guide should make it possible for 
a reader to determine: 

• what capabilities are provided • 

• how they are achieved for the user 

• how they are presented in the user guide 

• what types of data are acceptable 

• what types of input are used to create 
the output 

• how the statistical capabilities should be 
used 

• what constraints or limitations must be 
observed 

• how required resources should be esti- 
mated." 

5.7. User Interface 
Issues delude: 

• clarity and ease of use (Thisted, 1976) 

• diai'ies and histories 

• commands or menus (or both) 

• output format 

• error handling and messages (Eddy 
1981). 

5.8. Device Interfaces 

As hardware environments have become 
more diverse, it has become more difficult to 
find statistical program packages which sup- 
port the variety of output devices available. 
The possible devices include printers, plot- 
ters, and multi-window workstations. There 
are also a multitude of more exotic devices for 



which interfaces would be useful, for exam- 
ple, film recorders. Some statistical packages, 
especially those which are oriented towards 
graphics, provide a wide variety of interfaces; 
other packages provide only line printer out- 
put. 

5.9. Speed 

As the power of computer hardware in- 
creases, concern about the speed with which 
a particular calculation is performed within a 
statistical package diminishes. Nevertheless, 
the time a package takes to perform a partic- 
ular procedure is important to the user and 
can often provide clues to the overall quality 
of the package. Test; data sets which can serve 

a basis for comparing the speed of statisti- 
cal packages would be helpful. 

5.10. Extensibility 

A modem statistical package provides its 
own language for performing statistical cal- 
culations. The simplest of these merely al- 
lows for the creation of temporary variables 
for storage of intermediate calculations. The 
more sophisticated ones have many of the 
capabilities of more traditional programming 
languages; for example, loops, branches, sub- 
routines, recursion, complex data structures, 
etc. Ling (1980) has provided a preliminary 
sketch of what a user should expect. 

6. What Are Guidelines? 

In order to appreciate the value of guide- 
lines (for evaluating statistical software) it is 
important to understand the difference be- 
tween guidelines and standards. Generally, 
standards fall into two broad categories: 

1. those that are legislated; and 

2. those that are accepted. 

We begin by considering the latter. 

6.1. De Facto Standards 



There is, of course, no black and white 
here. Government contractual requirements 
which force software to be SVID-compliant 
are much closer to the legislative end of the 
gray area. Standards making bodies such as 
ANSI and ISO operate near the opposite end 
of the gray area. De facto standards are, in 
many senses, the best possible kind. There 
are few, if any, significant social or economic 
dislocations caused by the standards. They 
provide a generally higher level of functional- 
ity. 

Take, as ar. example, the lids on the metal 
cans that various processed foods are sold in. 
We always refer to these as "tin" cans al- 
though they probably haven't had much tin 
in them for years. As far as we know ev- 
ery can opener will open every can. This is 
a remarkable degree of standardization and a 
study of how it came to be would probably be 
very instructive for the standard makers. We 
should admit parenthetically that there are 
some specialized can openers which are not so 
ubiquitous. The kind of opener that is used to 
punch holes in the lid so that liquids can be 
poured out (beer drinkers usually call these 
"church keys") can only be used to remove 
solid food in an act of desperation well-known 
to campers. The resulting many-pointed star 
is one of the most dangerous parts of a camp- 
ing trip. The kind of opener often delivered 
with a can of sardines cannot be used for 
opening cans other than the special type that 
it was designed for. However, the can can 
(we apologize but couldn't resist) always be 
opened by a more traditional can opener; we 
know this from many experiences in which the 
key- like opener delivered with the sardine can 
failed to open it. 

Another example that leaps to mind is au- 
tomobile transmissions. These are classified 
into two large categories: 

1. automatic transmissions; and 

2. manual transmissions 

and crossed with this classification are two 
other large categories: 
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1. steering-column shifter; and 

2. floor-mounted shifter. 

There do exist "semi-automatic" transmis- 
sions, although we haven't seen a recently- 
manufactured one. And within each of the 
two categories there are a number of minor 
variation^}. But, for example, the gear pat- 
tern is generally a straight line for the au- 
tomatics in the order P R N D L. For the 
manual transmission there are a wider variety 
of patterns but the basic H-pattern seems ex- 
tremely wide-spread. The importance of this 
is that (once one has have learned where R 
is on the manual transmission) you can drive 
any car in the world. You cannot however 
drive a 18- wheel tractor trailer combination. 
Again it would be instructive to understand 
how the automobile transmission came to be 
standardized. When was the last time that 
you saw an automobile with an automatic 
transmission operated by push buttons on the 
dashboard? 

6.2. Dc Jure Standards 

These standards usually have the force of 
law behind them although one could argue 
that some ANSI standards are in this cat- 
egory. One of the most important exam- 
ples in this class is electric receptacles in the 
United States. The first and most impor- 
tant point is that no matter where you go 
in the United States the receptacle tells you 
what sort of electricity will be delivered. And 
to focus sharply on standard home 15 amp 
120V AC receptacles, every plug will go into 
every receptacle (which meets the standard). 
Like any good standard the National Electric 
Code (NEC) is evolving and modern polarized 
plugs will not always go into the receptacles 
installed thirty years ago; this is a genuine 
safety feature. However, thirty-year old plugs 
will go into modern receptacles. Another di- 
mension of compatibility concerns higher am- 
perage receptacles. A plug designed to deliver 
between 15 and 20 amps will not fit into a 15 
amp receptacle. However, a plug designed to 



deliver less than 15 amps will fit into a 20 
amp receptacle. So the standard preserves a 
specific kind of compatibility across time and 
size; the specific compatibility was carefully 
planned to increase safety for the user. 

How is the NEC enforced? Unfortunately, 
electric codes are deemed the provenance of 
local governments. So, each locality has (if 
it chooses, and most so choose because of 
the additional government jobs that are au- 
tomatically created) its own inspectors to en- 
force the code. The reason we wrote "un- 
fortunately" is that not every locality adopts 
the NEC; many write in their own variations 
on the NEC. Occasionally these variations are 
of good intent, reflecting some minor correc- 
tion to the NEC (and such variations are of- 
ten adopted into the NEC in its next revi- 
sion). More often, however, these variations 
are adopted for local political/economic rea- 
sons. We are thinking, for example, of re- 
quiring the use of BX caole instead of the 
more widespread NM. BX is more expensive 
so, we presume, the cable distributor makes a 
larger profit. Or the details of some installa- 
tion procedure may be varied so that it takes 
more time for the electrical contractor to per- 
form a specific installation, again generating 
a greater income. 

The situation with the NEC should be 
compared with other countries. A traveler 
to Europe (including Great Briiain) must be 
prepared lo deal with two different voltages 
(120V and 240V) and about five diff'erent 
sorts of receptacles; often there are several 
kinds in one country. And we have even seen 
cases v/here there were three diff'erent-shaped 
receptacles in one room; pity the poor folks 
who have only one kind of receptacle and they 
have an appliance with an incompatible plug. 

7. What Good Are Guidelines? 

First, we must understand the difference 
between guidelines and standards. A set of 
guidelines could be a preliminary standard, a 
sort of draft standard waiting to be adopted. 
A set of guidelines could be a standard that 
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failed to be adopted; a rule with no teeth. A 
set of guidelines could be advice to consumers. 

Guidelines can provide at least four impor- 
tant ind useful purposes: 

1. Guidelines can provide a commonality 
that enables consumers to share knowl- 
edge. Universality is particularly impor- 
tant iu complex areas as it provides a 
sort of lingua franca for communication 
among those who conform to the implied 
standards. 

2. Guidelines can help serve to codify ex- 
isting knowledge. Codification is equally 
important; with the passage of time cer- 
tain kinds of knowledge erodes from the 
collective consciousness unless it is pre- 
served in some formal way. 

3. Guidelines can help raise consumer ex- 
pectations. Raising the minimum accept- 
able standard seems particularly impor- 
tant with respect to statistical software. 
There is a great deal of very bad software; 
providing users a sort of guaranteed lower 
bound on the quality seems highly desir- 
able. 

4. Guidelines can provide some long-term 
stability in a rapidly changing environ- 
ment. Upward compatibility is invalu- 
able in the NEC; it is less clear what 
value such compatibility might have in 
statistical software. 
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